{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "K8HAOWyDgPiH"
   },
   "source": [
    "# Feature based methods"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "v6w1KA97gXao"
   },
   "source": [
    "In this notebook we will exploring a very naive (yet powerful) approach for solving graph-based supervised machine learning. The idea rely on the classic machine learning approach of handcrafted feature extraction.\n",
    "\n",
    "In Chapter 1 you learned how local and global graph properties can be extracted from graphs. Those properties represent the graph itself and bring important informations which can be useful for classification."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "yWL_AuChPcYS"
   },
   "source": [
    "In this demo, we will be using the PROTEINS dataset, already integrated in StellarGraph"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 51
    },
    "executionInfo": {
     "elapsed": 12513,
     "status": "ok",
     "timestamp": 1611482476990,
     "user": {
      "displayName": "Aldo Marzullo",
      "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GjBD_mZewcZ8LCqkD20Nku4DR5OCGFqYkxawoUjgg=s64",
      "userId": "17245895923239449231"
     },
     "user_tz": -60
    },
    "id": "gS5B47T2gWll",
    "outputId": "5b39bfb4-cc26-4a43-a20b-1cc657584e4b"
   },
   "outputs": [],
   "source": [
    "from stellargraph import datasets\n",
    "from IPython.display import display, HTML\n",
    "\n",
    "dataset = datasets.PROTEINS()\n",
    "display(HTML(dataset.description))\n",
    "graphs, graph_labels = dataset.load()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "YDlUMUUFLrjh"
   },
   "source": [
    "To compute the graph metrics, one way is to retrieve the adjacency matrix representation of each graph."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "qsOw9zFwrxDe"
   },
   "outputs": [],
   "source": [
    "# convert graphs from StellarGraph format to numpy adj matrices\n",
    "adjs = [graph.to_adjacency_matrix().A for graph in graphs]\n",
    "# convert labes fom Pandas.Series to numpy array\n",
    "labels = graph_labels.to_numpy(dtype=int)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "6S5M5mL2t-ik"
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import networkx as nx\n",
    "\n",
    "metrics = []\n",
    "for adj in adjs:\n",
    "  G = nx.from_numpy_matrix(adj)\n",
    "  # basic properties\n",
    "  num_edges = G.number_of_edges()\n",
    "  # clustering measures\n",
    "  cc = nx.average_clustering(G)\n",
    "  # measure of efficiency\n",
    "  eff = nx.global_efficiency(G)\n",
    "\n",
    "  metrics.append([num_edges, cc, eff])\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "1_a5CiZKL4vW"
   },
   "source": [
    "We can now exploit scikit-learn utilities to create a train and test set. In our experiments, we will be using 70% of the dataset as training set and the remaining as testset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "NRrNPqOxu7eY"
   },
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "X_train, X_test, y_train, y_test = train_test_split(metrics, labels, test_size=0.3, random_state=42)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "TMIF1weiMO0F"
   },
   "source": [
    "As commonly done in many Machine Learning workflows, we preprocess features to have zero mean and unit standard deviation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "9qUjNhPru6ni"
   },
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import StandardScaler\n",
    "\n",
    "scaler = StandardScaler()\n",
    "scaler.fit(X_train)\n",
    "\n",
    "X_train_scaled = scaler.transform(X_train)\n",
    "X_test_scaled = scaler.transform(X_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "QqaZzejRMdmu"
   },
   "source": [
    "It's now time for training a proper algorithm. We chose a support vector machine for this task"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "executionInfo": {
     "elapsed": 613,
     "status": "ok",
     "timestamp": 1610884464690,
     "user": {
      "displayName": "Aldo Marzullo",
      "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GjBD_mZewcZ8LCqkD20Nku4DR5OCGFqYkxawoUjgg=s64",
      "userId": "17245895923239449231"
     },
     "user_tz": -60
    },
    "id": "L3A6_fh0OV9x",
    "outputId": "6297d8fe-3cc9-435b-e8fe-b50425aaee24"
   },
   "outputs": [],
   "source": [
    "from sklearn import svm\n",
    "from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score\n",
    "\n",
    "clf = svm.SVC()\n",
    "clf.fit(X_train_scaled, y_train)\n",
    "\n",
    "y_pred = clf.predict(X_test_scaled)\n",
    "\n",
    "print('Accuracy', accuracy_score(y_test,y_pred))\n",
    "print('Precision', precision_score(y_test,y_pred))\n",
    "print('Recall', recall_score(y_test,y_pred))\n",
    "print('F1-score', f1_score(y_test,y_pred))"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "authorship_tag": "ABX9TyPz5JRYvXqH50rX95tddtHr",
   "collapsed_sections": [],
   "name": "Supervised_GraphML.ipynb",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
